GlycoMME, a Markov modeling platform for studying N-glycosylation biosynthesis from glycomics data

Summary Variations in N-glycosylation, which is crucial to glycoprotein functions, impact many diseases and the safety and efficacy of biotherapeutic drugs. Here, we present a protocol for using GlycoMME (Glycosylation Markov Model Evaluator) to study N-glycosylation biosynthesis from glycomics data. We describe steps for annotating glycomics data and quantifying perturbations to N-glycan biosynthesis with interpretable models. We then detail procedures to predict the impact of mutations in disease or potential glycoengineering strategies in drug development. For complete details on the use and execution of this protocol, please refer to Liang et al. (2020).1


SUMMARY
Variations in N-glycosylation, which is crucial to glycoprotein functions, impact many diseases and the safety and efficacy of biotherapeutic drugs. Here, we present a protocol for using GlycoMME (Glycosylation Markov Model Evaluator) to study N-glycosylation biosynthesis from glycomics data. We describe steps for annotating glycomics data and quantifying perturbations to N-glycan biosynthesis with interpretable models. We then detail procedures to predict the impact of mutations in disease or potential glycoengineering strategies in drug development. For complete details on the use and execution of this protocol, please refer to Liang et al. (2020). 1

BEFORE YOU BEGIN Overview
Timing: variable (<45 min with preprocessed data; <5 min if MATLAB and relevant toolboxes have been pre-installed). Additionally, undetermined amounts of time will be needed for data curation depending on the total numbers of glycoforms or signals in users' glycomic datasets, the curators' familiarity with LinearCode, and any additional steps to normalize and organize glycomics data into the required matrix form as shown in the Data.xlsx file.
Glycans coat most cells and decorate most proteins a cell uses to interact with its environment. [2][3][4] Importantly, these glycans frequently modulate protein-protein interactions and participate in self-non-self recognition. 5,6 Small changes in glycan structure can result in major impacts on protein function and organismal phenotypes. 5,7 Thus, there is particular interest in understanding the genetic basis of any changes in glycosylation. Furthermore, the modulation of glycosylation on therapeutic proteins can increase potency and safety, and biosimilar development requires the achievement of glycans that are equivalent to the innovator drug. [8][9][10][11] However, glycans are complex and their control is challenging since they are synthesized in complex pathways. Thus, there is a need to develop computational models of glycosylation to enable the use of systems glycobiology to control glycan structures. 12 Many types of models have been constructed for glycosylation. [13][14][15][16][17][18][19] While many require the enumeration of large numbers of kinetic parameters, which are difficult to obtain, more recently a different approach has been taken wherein glycan biosynthesis is modeled as a Markov process, 1,16,20 wherein only a glycoprofile is needed for a sample of interest to parameterize a detailed and comprehensive model of glycosylation. Such models show great potential in being used to guide the glycoengineering of cells to obtain desired glycoprofiles on therapeutic proteins and biosimilars. 16 They could further provide insights into the genetic basis of changes in glycosylation seen in diverse diseases. Here we present a protocol describing GlycoMME, a modeling toolbox that makes use of the Markov modeling principles.
This section includes minimal hardware requirements, installation process, and the files required for the pipeline.
Download GlycoMME toolkit 1. Download GlycoMME toolkit from https://github.com/LewisLabUCSD/N-Glycosylation-Markov-Models ( Figure 1). Click the green ''Code'' button at the upper right corner and download the toolkit and the example dataset as a zip file by clicking ''Download Zip''. 2. Unzip everything into the same master folder (default name as N-Glycosylation-Markov-Modelsmain).  4. In MATLAB, click the icon in the red square ( Figure 2A). 5. Navigate to the master folder (default name:N-Glycosylation-Markov-Models-main) and click Select Folder ( Figure 2B).

Data collection
Note: Glycomics data can be collected in different forms and some third-party glycomics data may have incomplete information regarding the glycoforms. For ease of manipulation, glycomics data should be organized in the provided excel sheet Data_user.xlsx (listed in key resources table). An example properly formatted dataset (Data.xlsx) is provided in the Data folder for reference. 6. Raw glycomic data need to be pre-processed before running the pipeline. Depending on the data types, two scenarios are possible:   ii. Similar to step 6a, vii, enter in the second column all the glycan annotations as strings of Linear Code. Each glycan annotation should be unique. iii. For each element of a column (excluding the sample name), enter the relative abundance for each specific glycan structure annotation (row) in each glycomic profile of a sample (column). Otherwise enter 0.
Note: Starting from the third column, each column represents a sample and the top cell of each column is a sample name. The sample names should only contain alphanumeric characters and forward slashes. For glycosyltransferase knockout predictions, the sample name must be ''WT'' for the base or wildtype glycomic profile, and symbols of knocked-out glycosyltransferases separated by forward slashes.

OPEN ACCESS
iv. Open and run the script Sup1_Get_compositions_from_linearcodes.m in the master folder to automatically populate the datasheet Data_user.xlsx.
Once completing data pre-processing, users may compare the format of Data_user.xlsx with that of Data.xlsx as a quality check.

STEP-BY-STEP METHOD DETAILS
Here, the described step-by-step methods are for two purposes as demonstrated in Liang et al. 1 First, a generic Markov model is fitted to the glycomic profiles (glycoprofiles) of a glycoprotein drug produced by different glycoengineered cell lines. By visualizing the biosynthetic models of N-glycosylation, users gain insights into the glycoengineering impact on theoretical glycosylation reactions. Second, one may use the fitted models of single-glycosyltransferase-knockout (single-GT-KO) cell lines to predict the theoretical glycoprofiles of a glycoprotein produced by the cell lines with combinatorial GT knockouts. The glycoprotein whose glycoprofiles are predicted can be a different protein from the glycoprotein whose single-GT-KO glycoprofiles are used for fitting. As examples, we generated sample-specific models for 7 glycoprofiles of glycoengineered erythropoietin (EPO) produced by different single-GT-KO CHO-K1 or wildtype cell lines. 22 We then use a few fitted single-GT-KO EPO models to predict 2 glycoprofiles of EPO produced by the cell lines with combinatorial glycosyltransferase (GT) knockouts. Lastly, we use a few fitted single-GT-KO EPO models to predict 2 glycoprofiles of Enbrel produced by similar CHO cell lines with the knockout impact learned from the EPO models.

Visualize and preprocess dataset
Timing: 5 min This step prepares the data in the Data_user.xlsx for usage by MATLAB and visualizes the user-supplied experimental glycoprofiles (glycomic data) and glycan structure annotations (if available). This step can also be used to check the quality of entered data. In the script editor, specify the name of the dataset to be used at line 5.
Note: ''Data.xlsx'' is the name for the demonstration dataset and ''Data_user.xlsx'' is the name for the user-supplied dataset.
3. Run the entire script. A figure of the relative glycomic signals and a figure of glycan annotations will be generated for each sample, such as shown in Figure 3.
Note: for advanced users, rendering selected glycan annotations only can be achieved by modifying the inputs of function visualizeExpData according to the comments in the script.
4. The pre-processed data are stored in the structure variable DataSet and as MATLAB-readable file Data.mat in the Data folder.
Note: for advanced users, information regarding the created variables for pre-processed data can be found as comments in the script.

Construct a generic N-glycosylation Markov model
Timing: 10 min This step will generate a generic N-glycosylation Markov model, which will be later used to generate sample-specific models by fitting to the experimental glycoprofiles.

Open the MATLAB script
Step2A_Create_Generic_Glycan_Network_Fitting.m. 6. In the script editor, specify the name of the dataset to be used in line 4.
Note: ''Data.xlsx'' is the name for the demonstration dataset and ''Data_user.xlsx'' is the name for the user-supplied dataset.
7. Run the entire script.
Note: for advanced users, the scope of the reactions considered can be achieved by modifying the variable RxnSel and the inputs of function CreateGlycanRxnList according to the comments in the script. For practicality, the complexity level of the network should not exceed 23. Complexity level is defined as the maximal number of times a core Man9 glycan is modified by glycosidases and/or glycosyltransferases in a stepwise fashion.
8. The generic model and fitting parameters are stored in the structure variable GenericNetwork and as MATLAB-readable file GenericNetwork.mat in the Data folder. This step generates sample-specific N-glycosylation Markov models by fitting the generic model with their corresponding experimental glycoprofiles. This is a time-intensive step, and multiple MATLAB sessions can run in parallel to expedite the process.  Note: The name(s) provided here should be the same as the names appearing in the variable DataSet.ProfNames. To fit all the profiles in the dataset, users can instead specify the variable ProfSel as: 12. Specify the number of models to be fitted for the wildtype sample (base glycoprofile) in Line 25 and for each of the other samples in Line 23. For Example: Note: this is the number of models fitted for each profile per MATLAB session. In this case, 3 MATLAB sessions were run in parallel, with a total of 5 3 3 = 15 models fitted for each single-GT-KO profile and 10 3 3 = 30 models fitted for the wildtype profile.
13. Specify any arbitrary number for IDNum in Line 97. For example: Note: The fitted model parameters are stored in the structure variable OptimizationResults and as MATLAB-readable file OptimizationResults_#.mat in the Data/OptimizationResults folder. Here, number 1 is arbitrarily chosen as the IDNum and OptimizationResults_1.mat will be the file name.
Note: After step 14, users may decide whether to run multiple MATLAB sessions in parallel to reduce the time needed to obtain sufficient models for each sample. The memoryintensive optimization process will need approximately 8 GB RAM per MATLAB session.  a. Graphs in Figure 5 allow users to inspect the fitted model parameters (transition probabilities, or TPs) and assess the fitting quality, similarly performed in Liang et al. 1 b. Figure 6 allows users to inspect which glycan intermediates are theoretically important to generate the experimentally observed glycoprofiles (if the models are well-fitted). c. Figure 7 allows advanced users to compare additional network parameters by reaction types of the fitted Markov models. 19. Computed model features and glycoprofiles are now generated from the fitted Markov models and added as new subfields in the structure variable OptimizationResults.
Note: Please refer to the comments in the script for additional information regarding the subfields. The new OptimizationResults variable is stored as ProcessedModels.mat in the Data folder.  . Fitted RMSE represents the root mean squared error between the fitted average and the experimentally measured signals, whereas the random RMSE represents the error between the average signals generated by multiple random models and the experimentally measured signals. Leakage represents the percentage of total signals that are not detected in the experimental measurement. Generally, well-fitted models should meet at least two criteria: Random RMSE should be at least 10 times bigger than the Fitted RMSE and the leakage should be smaller than 10%. The entirety of predicted data can be inspected in the variable Optimization.WT.Predata_noRes, where each column represents a m/z value (or numbering) and each row represents a fitted model.   Note: Figure 8 shows an example of the predicted EPO glycoprofile produced by the cell line with Mgat2/St3gal4/St3gal6 double knockouts.
Predict Enbrel glycoprofiles from a combinatorial GT knockout, based on fitted EPO glycoprofiles of single GT knockouts and Enbrel glycoprofile produced in wildtype cells

Timing: 12 h
This step predicts the Enbrel glycoprofiles of combinatorial GT knockouts de novo from the fitted EPO glycoprofiles of single GT knockouts.
27. Following the same step for data collection (before you begin, step 10), organize the Enbrel glycoprofile produced by the wildtype cell line (as a demonstration) as the Excel file DrugXData.xlsx.  Note: The fitted model parameters are stored in the structure variable OptimizationResults and as MATLAB-readable file OptimizationResults_#.mat in the Data/OptimizationResults folder. Here, number 1 is chosen and OptimizationResults_DrugXWT_1.mat will be the file name.

Click on and open
Step6B_Predic_glycoengineered_glycoprofiles_of_DrugX.m in the ''Current Folder'' panel in MATLAB. 37. Starting from Line 13, specify EPO glycoprofiles with the desired combinations of isozyme knockouts that will be predicted, along with the name of the wildtype EPO glycoprofile.
Note: The demonstration specifies two predictions of EPO glycoprofiles from cell lines with B3gnt2 single knockout or with B3gnt2/St3gal3/St3gal4/St3gal6 quadruple knockouts.
38. Run the script, and three graphs will be generated for each predicted glycoprofile.
CRITICAL: Figure 10 shows an example of the predicted Enbrel glycoprofile produced by cell lines with B3gnt2/St3gal3/St3gal4/St3gal6 quadruple knockouts.

EXPECTED OUTCOMES
GlycoMME is a simple-to-use yet powerful toolkit that allows users to interrogate theoretical perturbations to N-glycosylation biosynthesis caused by glycoengineering efforts by inferring the glycosylation processes from glycomics data. Specifically, GlycoMME is a low-parameter, biologically interpretable modeling framework that quantifies the impact of knocking out specific glycosyltransferase isozymes on the glycoprofiles of glycoproteins by modeling glycosylation as a Markov process. 1,16,20 The framework does not rely on any manual parameter optimization or approximation of enzyme kinetics, and has demonstrated its usefulness in facilitating rational glycoengineering by predicting glycoprofiles de novo. 1 Here, we demonstrated that the framework can visualize glycomic data of various resolutions (with or without glycoform annotations). The fitted results shed light on the complex interactions between glycosyltransferases and their specificities toward different N-glycan epitopes. 1 GlycoMME can further use the learned impact of the GT isozymes for de novo prediction of glycoprofiles produced from cells with complex GT isozyme knockout genotypes, even for different recombinant glycoproteins. 1 While shown here for the analysis of glycoengineered proteins, this framework could be used to investigate a wide range of questions, including the identification of the genetic basis of congenital disorders of glycosylation, differential regulation of glycosylation in diseases such as cancer, and a variety of other biological questions involving changes in glycan structures. [23][24][25][26] Given that the model parameters are associated with specific glycosyltransferase reactions, GlycoMME has the potential to generate hypotheses directly testable by other omics data, such as proteomics and transcriptomics. The integration of glycomics with other omics data in the framework of GlycoMME will help us decipher the complex regulatory machinery of N-glycosylation and make rational glycoengineering a near possibility.

LIMITATIONS
While GlycoMME can accurately reproduce a variety of glycoengineered glycoprofiles, the real biological system of N-glycosylation is more complicated than the Markov model in its current form. Evidently, fitted models are in general more accurate than predicted models, and some fitted glycoprofiles are more accurate than others. Therefore, errors may be expected when predicted glycoprofiles are generated under different conditions from those of the fitted glycoprofiles, such as variations in cell types, media/supplement, and other genetic manipulations that may indirectly influence GT activities in N-glycosylation. More research is needed to improve model parametrization and make GlycoMME more versatile in considering glycoengineering impact beyond GT knockouts.

Potential solution
The error is caused by inability to reconstruct the models from the provided model parameters. First, the user should ensure that the variable OptimizationResults.Name.xval is not empty. Users may check directly by typing ''isempty(OptimizationResults.(ProfSel{a}).xval)'' in the command window. If the variable is indeed empty, change the value of Method (line 48) from '''KernalDensity'' to ''Outlier'' for the function FilterOptimizationResults and rerun the entire script. If the issue is still not resolved, obtain additional models by rerunning the fitting steps for the specific profile from steps 9-19.

Problem 3
At steps 18 or 35, the fitting quality is not satisfactory, such that the leakage is greater than 0.15 and/ or the ratio of random RMSE/fitted RMSE is less than 5 ( Figure 5B).

Potential solution
Double check the grammar of the Linear Code and the composition strings for this profile in the ''Annotation'' or ''MS Raw'' sheets of the source data file (e.g., Data_user.xlsx), especially for signals with significant errors between the experimental and the fitted profiles. Upon confirmation, obtain additional models (>6 models) by rerunning the fitting steps for the specific profile from steps 9-19.

Potential solution
If annotation exists for these signals, recheck the corresponding glycoforms of their annotations (Figure 1B). Upon confirmation, change their values from 1 to 0 in the ''Annotation'' sheet of the source data file (e.g., Data_user.xlsx). Rerun steps 1-8. Next, type ''open SetUpFittingProblem'' in the command window and change the value for ''MaxTime'' from 3600 to 7200 at line 70 of the opened script. Save the modified script. Finally, obtain additional models (>6 models) by rerunning the fitting steps for the specific profile from steps 9-19.

Potential solution
This error occurs when the enzyme names used for knockout annotations are inconsistent between the single-knockout models and the predicted multiple-knockout models. Ensure to only use alphanumerical characters for knockout profile names (e.g., the knocked-out enzymes' names) without space. Upon confirmation, ensure the variables KnockoutSel and BaseProfSel modified at step 25 are consistent with the names of the fitted profiles.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Nathan E. Lewis, (nlewisres@ucsd.edu).

Materials availability
This study did not generate new unique reagents.

Data and code availability
The EPO datasets were obtained from a previous study 22 and processed to be used as a part of the example dataset. The processed example datasets and the pipeline codes are available at https:// github.com/LewisLabUCSD/N-Glycosylation-Markov-Models/tree/main (https://doi.org/10.5281/ zenodo.7742912). The complete Enbrel dataset supporting the current study has not been deposited in a public repository because of ongoing investigation of the dataset but is available from the corresponding author on request.